Customer Relationship Management (CRM) plays a crucial role in the strategy of marketing by providing organizations with the business intelligence for building, managing, and developing valuable long-term customer relationships. Many businesses have come to realize the significance of CRM and the application of technical expertise to achieve competitive advantage.
This problem centers on AllLife Bank's credit card customer base, and the bank would like to accomplish two goals. First, AllLife Bank seeks to improve its market share of credit card customers. More specifically, the marketing team proposes to run personalized campaigns to target new customers, as well as to generate more revenue from existing customers. Second, AllLife Bank would like to upgrade its service delivery model to ensure more timely problem resolution given market feedback that customers negatively perceive its credit card support services.
In order to advise AllLife Bank, we will use some of the most useful techniques in business analytics for the analysis of consumer behavior and categorization: customer segmentation. We will use clustering techniques, grouping together in homogeneous clusters those customers with similar means, end, and behavior. Customer segmentation should allow us to advise AllLife Bank marketing strategies to identify or reveal distinct groups of customers who think and function differently and follow varied approaches in their spending and purchasing habits. Clustering techniques should reveal internally homogeneous and externally heterogeneous groups.
Since customers vary in terms of behavior, needs, wants, and characteristics, our main goal in using clustering techniques will be to identify different customer types and segment the customer base into clusters of similar profiles. These segmented profiles will allow us to develop target marketing that can be executed more efficiently for AllLife Bank's two goals.
Data Set Attribute Information:
Set forth below is the data of various customers of AllLife Bank for our analysis, including their credit limit, the total number of credit cards the customer has, and the different channels through which customer has contacted AllLife Bank for any queries, including the means by which they engaged AllLife Bank for support services (visiting the bank, online, and through a call).
Prior to using the data to train the machine learning models, we must first analyze and preprocess the data. I will be on the lookout for certain issues in my data analysis, including but not limited to, the following:
As specifically required by the assignment, we will conduct the following analysis: (i) typical univariate analysis, including but not limited to analysis of the customer variables' distributions/tails, missing values, outliers, duplicates, and (ii) perform exploratory data analysis, creating visulations to explore data (10 marks). A lot of this analysis will be accomplished by an initial review of a panda profile report. We will also properly comment on codes, provide explanations of the steps taken, and provide insights from our data analysis.
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import itertools
import numpy as np
import os
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn import cluster
from sklearn.cluster import SpectralClustering
read_file = pd.read_excel ('Credit Card Customer Data.xlsx')
read_file.to_csv ('Credit Card Customer Data.csv', index = None, header=True)
df = pd.read_csv('Credit Card Customer Data.csv')
df_copy=df.copy()
df.head()
df.info()
df.isnull().sum()
#Dropping any duplicates
df = df.drop_duplicates()
df.shape
df.describe().transpose()
Initial Insights
As can be seen from an initial review of the data, all the sample data is filled and there is no need to determine how to handle missing data. The sample data seems to also have appropriate data types for our analysis. There were no duplicate lines of data. I now want to explore the data through univariate and bivariate analysis.
As always for me, I find a review of the data in a panda profile report helpful.
from pandas_profiling import ProfileReport
df.profile_report()
Initial Observations: We notice that there are 5 Customer Key numbers that have two entries, so we want to look at this data more closely. There are also a significant number of zeroes for some of the data, but this is not surprising in terms of some customers displaying certain behaviors while not displaying other behaviors.
df[df["Customer Key"] == 47437]
df[df["Customer Key"] == 37252]
df[df["Customer Key"] == 97935]
df[df["Customer Key"] == 96929]
df[df["Customer Key"] == 50706]
Additional Observations
In order for ease of reference, after addressing the issues outlined above for the Sl_No and Customer Key variables, we will create a pairplot and heatmap to explore and analyze the data.
# Getting rid of the earlier customer records
df = df[df["Sl_No"] != 5]
df = df[df["Sl_No"] != 49]
df = df[df["Sl_No"] != 105]
df = df[df["Sl_No"] != 392]
df = df[df["Sl_No"] != 412]
#Convert the number of credit cards held by customer into dummy variables
#(This is subject to business knowledge, and number of credit cards is usually important in banking.)
one_hot = pd.get_dummies(df['Total_Credit_Cards'])
one_hot = one_hot.add_prefix('CC_')
# merge in main data frame
df = df.join(one_hot)
df.head()
#Dropping columns Sl_No and Cutomer Key. Also Dropping Total_Credit_Cards since it is covered by other variables
df = df.drop(columns='Sl_No')
df = df.drop(columns='Total_Credit_Cards')
df = df.drop(columns='Customer Key')
df.head()
Let's take another look at our treated variables and their correlations.
sns.pairplot(df)
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(),
annot=True,
linewidths=.5,
center=0,
cbar=False,
cmap="YlGnBu")
plt.show()
Observations of Correlations: With the assumption that correlation values > 0.3 to be significant among the variables, one can observe the following:
Now we will review the data for outliers.
df["1-3CCards"] = df["CC_1"] + df["CC_2"] + df["CC_3"]
df["4-7CCards"] = df["CC_4"] + df["CC_5"] + df["CC_6"] + df["CC_7"]
df["8-10CCards"] = df["CC_8"] + df["CC_9"] + df["CC_10"]
df.head()
df = df.drop(columns='CC_1')
df = df.drop(columns='CC_2')
df = df.drop(columns='CC_3')
df = df.drop(columns='CC_4')
df = df.drop(columns='CC_5')
df = df.drop(columns='CC_6')
df = df.drop(columns='CC_7')
df = df.drop(columns='CC_8')
df = df.drop(columns='CC_9')
df = df.drop(columns='CC_10')
df.head()
plt.figure(figsize=(12,6))
sns.boxplot(data=df, orient="h", palette="Set2", dodge=False)
Action Note: Treat Outliers. We will treat the Avg_Credit_Limit for outliers.
# Let us take logaritmic transform for Avg_Credit_Limit to remove outliers
df['Avg_Credit_Limit'] = np.log(df['Avg_Credit_Limit'])
#Confirming treatment of outliers
sns.boxplot(df['Avg_Credit_Limit'])
As stated above, we will normalize our data set to prepare the data for our machine learning tools.
from scipy.stats import zscore
df_std =df.apply(zscore)
df_hc = df_std
df_std.head()
K-Means Clustering
K-Means is one of the most widely used clustering algorithms, and is both simple and efficient. The aim of K-Means algorithm is to divide M points in N dimensions into K clusters (assume k centroids) fixed a priori. These centroids should be placed in a wise fashion so that the results are optimal which otherwise can differ if locations of the centroids change. So, they should be placed as far as possible from each other. Each data point is then taken and associated with the nearest centroid until no data points are pending. This way an early grouping is done and at this point, k new centroids have to be recalculated as these will be the centers of the clusters formed earlier. After having calculated these centroids, the data points are then allocated to the clusters to the nearest centroids. In this iteration, the centroids change their position stepwise until no further modifications have to be done and the location of the centroids remain intact.
The K-Means algorithm is relatively simple. The K cluster points, which will be the centroids, are placed in the space among the data points. Each data point is assigned to the centroid for which the distance is the least. After each data object has been assigned, centroids of the new groups are re-calculated. The above two steps are repeated until the movement of the centroid ceases. This means that the objective function of having the least squared error is completed and it cannot be improved further. Hence, we get K clusters as a result.
K-Means algorithm aims at minimizing an objective function, which is measured by the squared-error. It is an indicator of the distance of the data points from their respective cluster centers. The process in this algorithm always terminates but the relevance or the optimal configuration cannot be guaranteed even when the condition on the objective function is met. The algorithm is also sensitive to the selection of the initial random cluster centers.
Metrics. Sum of Squares within Cluster (SSWC) is simple and the most widely used criterion to gauge the validity ofthe clusters. Smaller values of SSWC mean better clusters. We will review this measurement with the use of the elbow plot. Obtaining a Silhouette score is another measurement of validity. We will also seek to visualize the data in an effort to understand the clusters of segmented customers.
Sum_of_squared_distances = []
K = range(1,7)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(df_std)
Sum_of_squared_distances.append(km.inertia_)
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
In the plot above the elbow is at k=3 indicating the optimal k for this dataset is 3.
Now let us review the Silhouette scores.
silhouette_scores = []
for n_cluster in range(2, 7):
silhouette_scores.append(
silhouette_score(df_std, KMeans(n_clusters = n_cluster).fit_predict(df_std)))
# Plotting a bar graph to compare the results
k = [2, 3, 4, 5, 6]
plt.bar(k, silhouette_scores)
plt.xlabel('Number of clusters', fontsize = 10)
plt.ylabel('Silhouette Score', fontsize = 10)
plt.show()
The Silhouette score confirms that the optimal number of clusters is 3.
#Setting the value of k=3 for our K-Means Clustering
kmeans = KMeans(n_clusters=3, n_init = 7, random_state=2345)
kmeans.fit(df_std)
centroids = kmeans.cluster_centers_
centroids
#Calculating the centroids for the columns to profile
centroid_df = pd.DataFrame(centroids, columns = list(df_std) )
print(centroid_df)
## Creating new dataframe only for labels and converting it into categorical variable
df_labels = pd.DataFrame(kmeans.labels_ , columns = list(['labels']))
df_labels['labels'] = df_labels['labels'].astype('category')
# Joining the label dataframe with the data frame.
df_labeled = df.join(df_labels)
df_analysis = (df_labeled.groupby(['labels'] , axis=0)).head(4177)
df_analysis
df_labeled['labels'].value_counts()
## 3D plots of clusters
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(8, 6))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=20, azim=60)
k3_model=KMeans(3)
k3_model.fit(df_std)
labels = k3_model.labels_
ax.scatter(df_std.iloc[:, 0], df_std.iloc[:, 1], df_std.iloc[:, 2],c=labels.astype(np.float), edgecolor='k')
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Length')
ax.set_ylabel('Height')
ax.set_zlabel('Weight')
ax.set_title('3D plot of KMeans Clustering')
Observation: Viewed in the 3D plot, the customer segments are clear with only few data mixed in with the distinct groups. These seem like mostly consistent clusters.
final_model=KMeans(3)
final_model.fit(df_std)
prediction=final_model.predict(df_std)
#Append the prediction
df_std["KCluster"] = prediction
print("KCluster Assigned : \n")
df_std[["Avg_Credit_Limit", "KCluster"]]
df_std.boxplot(by = 'KCluster', layout=(8,2), figsize=(15, 20))
Observations of K-Means Clustering
Hierarchical Clustering
Hierarchical clustering is a method of cluster analysis which builds a hierarchy of data points as they move into a cluster or out of it. Strategies for this algorithm generally fall into two categories: agglomerative and divisive. Agglomerative is a bottom-up approach by which each observation begins as an initial cluster and then merges into clusters as they move up the hierarchy. The divisive technique is a top-down approach where there is only one cluster initially and is then split into finer cluster groups as they move down the hierarchy. This merging and splitting of clusters takes place in a greedy manner and the hierarchical algorithm yields a dendrogram which represents the nested grouping of patterns and the levels at which groupings change.
We are going to execute hierarchical clustering (with different linkages, including Ward, average, and complete) with the assistance of Silhouette scores to review all of the different combinations of clusters (2:7) and linkage methodologies. In terms of the different linkages we will review, "average linkage" means the distance between all members of one cluster is calculated to all other members in a different cluster. The average of these distances is then utilized to decide which clusters will merge. "Complete linkage means the distances between the most dissimilar members for each pair of clusters are calculated and then clusters are merged based on the shortest distance. "Ward linkage" uses the analysis of variance method to determine the distance between clusters. There are other linkage methods but we are not seeking to be exhaustive in our hierarchical clustering analysis.
We will then select the highest scores to analyze using dendograms and the cophenetic coefficient (or more precisely, the cophenetic correlation coefficient), which is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points. We note that we could use dendograms and calculate the cophenetic coefficient for each and every n cluster and different linkage methods, but that is not time efficient for this exercise.
We will finally analyze the customer segment clusters formed by hierarchical clustering using boxplots.
from scipy.spatial.distance import pdist #Pairwise distribution between data points
from scipy.cluster.hierarchy import cophenet, dendrogram, linkage
siliuette_list_hierarchical = []
for cluster in range(2,7):
for linkage_method in ['ward', 'average', 'complete']:
agglomerative = AgglomerativeClustering(linkage=linkage_method, affinity='euclidean',n_clusters=cluster).fit_predict(df_hc)
sil_score = metrics.silhouette_score(df_hc, agglomerative, metric='euclidean')
siliuette_list_hierarchical.append((cluster, sil_score, linkage_method, len(set(agglomerative))))
df_hierarchical = pd.DataFrame(siliuette_list_hierarchical, columns=['cluster', 'sil_score', 'linkage_method', 'number_of_clusters'])
df_hierarchical.sort_values('sil_score', ascending=False)
Based on these results, we will choose some linkage methods and cluster numbers for further analysis using dendograms and calculating the cophenet index scores.
#Review "Average" Methodology with different clusters
Z = linkage(df_hc, 'average', metric='euclidean')
Z.shape
# cophenet index - the closer it is to 1, the better is the clustering
Z = linkage(df_hc, metric='euclidean', method='average')
c, coph_dists = cophenet(Z, pdist(df_hc))
c
This is close to 1.
plt.figure(figsize=(25, 10))
dendrogram(Z)
plt.show()
plt.figure(figsize=(25, 10))
dendrogram(
Z,
truncate_mode='lastp', # show only the last p merged clusters
p=3, # show only the last p merged clusters
)
plt.show()
This is good but the only thing that bothers me in my review of the dendogram is that there one cluster with very few customers (8). Let's see what results when truncate with a p value of 4.
plt.figure(figsize=(25, 10))
dendrogram(
Z,
truncate_mode='lastp', # show only the last p merged clusters
p=4, # show only the last p merged clusters
)
plt.show()
Unclear whether this additional customer segment cluster is helpful, but I like the more balanced clusters. Checking with a value of 5 clusters.
plt.figure(figsize=(25, 10))
dendrogram(
Z,
truncate_mode='lastp', # show only the last p merged clusters
p=5, # show only the last p merged clusters
)
plt.show()
A p value of 5 did not help as it created a customer segment of 1.
#Review "Ward" Methodology with different clusters
Z = linkage(df_hc, 'ward', metric='euclidean')
Z.shape
Z = linkage(df_hc, metric='euclidean', method='ward')
c, coph_dists = cophenet(Z, pdist(df_hc))
c
Not as good of score as "average" linkage, but not bad either.
plt.figure(figsize=(25, 10))
dendrogram(Z)
plt.show()
plt.figure(figsize=(25, 10))
dendrogram(
Z,
truncate_mode='lastp',
p=3,
)
plt.show()
plt.figure(figsize=(25, 10))
dendrogram(
Z,
truncate_mode='lastp',
p=4,
)
plt.show()
plt.figure(figsize=(25, 10))
dendrogram(
Z,
truncate_mode='lastp',
p=5,
)
plt.show()
Observations: I am new to this but these dendrograms are yielding surprising results. While the cophenet scare is lower for the "Ward" linkage hierarchical cluster method, the spread and balance of the 4 and 5 clusters is remarkable. I prefer the five clusters.
#Review "Ward" Methodology with different clusters
Z = linkage(df_hc, 'complete', metric='euclidean')
Z.shape
Z = linkage(df_hc, metric='euclidean', method='complete')
c, coph_dists = cophenet(Z, pdist(df_hc))
c
A high cophenet score as well.
plt.figure(figsize=(25, 10))
dendrogram(Z)
plt.show()
plt.figure(figsize=(25, 10))
dendrogram(
Z,
truncate_mode='lastp',
p=3,
)
plt.show()
plt.figure(figsize=(25, 10))
dendrogram(
Z,
truncate_mode='lastp',
p=4,
)
plt.show()
plt.figure(figsize=(25, 10))
dendrogram(
Z,
truncate_mode='lastp',
p=5,
)
plt.show()
Observations: These results are similar to our analysis using the "average" linkage method.
Based upon our analysis, 3 clusters and the average linkage method gives us the best overall scores using Silhouette and the cophenetic correlation coefficient. However, it is not clear to me how important the difference in cophenetic scores between average and Ward linkages are, and the customer segment distributions created by the Ward linkage method are superior when reviewing the dendograms. I am going to move forward using the Ward linkage method with 5 clusters.
Z = linkage(df_hc, metric='euclidean', method='ward')
from scipy.cluster.hierarchy import cut_tree
HC_cluster_labels = cut_tree(Z, n_clusters=5).reshape(-1, )
HC_cluster_labels
df_hc["Hierarchical_Cluster_labels"] = HC_cluster_labels
df_hc.head()
df_hc['Hierarchical_Cluster_labels'].value_counts()
df_hc.boxplot(by='Hierarchical_Cluster_labels', layout = (5,2),figsize=(20,15))
Analysis of the Different Clusters
Discussion of the Different K-Means and Hierarchical Clustering Methods
General Discussion. Due to increasing commercialization, consumer data is increasing exponentially. When dealing with this large magnitude of data, organizations need to make use of more efficient clustering algorithms for customer segmentation. These clustering models need to possess the capability to process this enormous data effectively.
In preparation of this project, I researched the use of K-Means and hierarchical clustering for customer segmentation. I learned that in each of the above discussed clustering algorithms come with their own set of merits and distractions. The computational speed of K-Means clustering algorithm is relatively better as compared to the hierarchical clustering algorithms as the latter requires the calculation of the full proximity matrix after each iteration. K-Means clustering gives better performance for a large number of observations while hierarchical clustering has the ability to handle fewer data points. This is ostensibly due to the difficulty of visualizing a dendogram with large numbers of data points.
The major hindrance of K-Means clusterings is in the form of selecting the numbers of 'K' clusters for the K-Means process, which must be provided as an input to this non-hierarchical clustering algorithm. This limitation does not exist in the case of hierarchical clustering since it does not require any cluster centers as input. Hierarchical clustering also gives better results as compared to K-Means when a random dataset is used. The output or results obtained when using hierarchical clustering are in the form of dendrograms but the output of K-Means consists of flat-structured clusters which may be difficult to analyze. As the value of 'K' increases, the quality(accuracy) of hierarchical clustering improves when compared to K-Means clustering. As such, partitioning algorithms like K-Means are suitable for large datasets while hierarchical clustering algorithms are more suitable for small datasets. Given my lack of experience, I do not know whether this data set is considered a large dataset, but I suspect it is not.
Another takeaway from my research is that both K-Means and Hierarchical clustering have drawbacks that make them unsuitable when used individually. For business use, including in developing marketing and other business strategies, data visualization forms a major part of efficient data analysis and hierarchical clustering aids in doing so. However, when the performance aspect is taken into account, K-Means tends to deliver better results. With the advantages and disadvantages of the two techniques highlighted, it leaves me wondering whether one can combine these two clustering methodologies (multiple slow learners to form a smarter learner) could outperform the individual machine learning models.
Now let us take a look at several of the variables in the dataset in comparison of the customer segments created by each of the K-Means Clustering analysis and the Hierarchical Cluster analysis for a more specific comparison of the two different customer segments (clusters) created.
# scatter plot using the first two principal components to observe the cluster distribution
import matplotlib.pyplot as plt
plt.figure(figsize=(12,6),dpi=200)
plt.subplot(1,2,1)
sns.scatterplot(x='Avg_Credit_Limit' , y='Total_visits_online',data=df_std,hue='KCluster')
plt.subplot(1,2,2)
sns.scatterplot(x='Avg_Credit_Limit', y='Total_visits_online',data=df_hc,hue='Hierarchical_Cluster_labels')
# scatter plot using the first two principal components to observe the cluster distribution
import matplotlib.pyplot as plt
plt.figure(figsize=(12,6),dpi=200)
plt.subplot(1,2,1)
sns.scatterplot(x='Avg_Credit_Limit' , y='Total_visits_bank',data=df_std,hue='KCluster')
plt.subplot(1,2,2)
sns.scatterplot(x='Avg_Credit_Limit', y='Total_visits_bank',data=df_hc,hue='Hierarchical_Cluster_labels')
# scatter plot using the first two principal components to observe the cluster distribution
import matplotlib.pyplot as plt
plt.figure(figsize=(12,6),dpi=200)
plt.subplot(1,2,1)
sns.scatterplot(x='Avg_Credit_Limit' , y='Total_calls_made',data=df_std,hue='KCluster')
plt.subplot(1,2,2)
sns.scatterplot(x='Avg_Credit_Limit', y='Total_calls_made',data=df_hc,hue='Hierarchical_Cluster_labels')
Discussion of the Different K-Means and Hierarchical Clustering Methods Continued
It is difficult to get a feel by graphically reviewing the different clusters generated K-Means and Hierarchical Clustering techniques. But, in advising AllLife Bank on our two engagements (run personalized campaigns to target new customers and potential upgrades for its service delivery model to ensure timely problem resolution), the Hierarchical Clustering seemed to be better.
While it is clear that the Silhouette scores for the K-Mean Cluster results is higher than for Hierarchical Clustering Results, and it is also clear, when using the Hierarchical Clustering different linkage methodologies (and numbers of clusters), that higher Silhouette and cophenet scores are available, but data visualization forms a major part of efficient data analysis and hierarchical clustering aids in doing so. The value of viewing the dendograms and the box plots for our chosen Hierarchical Clustering ("Ward" linkage using 5 clusters) is evident from our ability to infer more about the customer segments and think about strategy to advise AllLife Bank on its problems.
Let's also recall the three general questions posed by our problem set. First, how many different segments of customers are there? Second, how are these segments different from each other? Third, what are your recommendations to the bank on how to better market to and service these customers?
We have covered the first two questions throughout our observations above. Let's focus on recommendations, however, using our preferred Hierarchical Clustering result. Here are some thoughts on recommendations: